{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "b699e295",
   "metadata": {},
   "source": [
    "# Mean encoding - simple\n",
    "\n",
    "[Feature Engineering for Time Series Forecasting](https://www.trainindata.com/p/feature-engineering-for-forecasting)\n",
    "\n",
    "In this notebook, we will encode static features with mean encoding. We will split the data into train and test sets, learn the mean target value per category using the train set, and then encode both the train and test sets with those learned parameters.\n",
    "\n",
    "It has the advantage that this logic is implemented by open-source libraries.\n",
    "\n",
    "The drawback is that we may overfit because we are leaking future data into the past. \n",
    "\n",
    "We will use the online retail dataset, which we prepared in the notebook `02-create-online-retail-II-datasets.ipynb` located in the `01-Create-Datasets` folder."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "49b2f0bf",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "from feature_engine.encoding import MeanEncoder"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5a174f3b",
   "metadata": {},
   "source": [
    "## Load data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "67a2af74",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>country</th>\n",
       "      <th>quantity</th>\n",
       "      <th>revenue</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>week</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2009-12-06</th>\n",
       "      <td>Belgium</td>\n",
       "      <td>143</td>\n",
       "      <td>439.1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2009-12-13</th>\n",
       "      <td>Belgium</td>\n",
       "      <td>10</td>\n",
       "      <td>8.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2009-12-20</th>\n",
       "      <td>Belgium</td>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2009-12-27</th>\n",
       "      <td>Belgium</td>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2010-01-03</th>\n",
       "      <td>Belgium</td>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            country  quantity  revenue\n",
       "week                                  \n",
       "2009-12-06  Belgium       143    439.1\n",
       "2009-12-13  Belgium        10      8.5\n",
       "2009-12-20  Belgium         0      0.0\n",
       "2009-12-27  Belgium         0      0.0\n",
       "2010-01-03  Belgium         0      0.0"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_csv(\"../Datasets/online_retail_dataset_countries.csv\",\n",
    "                parse_dates=[\"week\"],\n",
    "                index_col=\"week\",\n",
    "                )\n",
    "\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4a419d6a",
   "metadata": {},
   "source": [
    "## Split into train and test"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "1f4c0763",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Split the data before and after June 2011\n",
    "\n",
    "X_train = df[df.index <= pd.to_datetime('2011-06-30')]\n",
    "X_test = df[df.index > pd.to_datetime('2011-06-30')]\n",
    "\n",
    "y_train = X_train[\"revenue\"]\n",
    "y_test = X_test[\"revenue\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "928be034",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(Timestamp('2009-12-06 00:00:00'), Timestamp('2011-06-26 00:00:00'))"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# sanity check\n",
    "\n",
    "X_train.index.min(), X_train.index.max()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "6e838b49",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(Timestamp('2011-07-03 00:00:00'), Timestamp('2011-12-11 00:00:00'))"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# sanity check\n",
    "\n",
    "X_test.index.min(), X_test.index.max()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d5de7aa0",
   "metadata": {},
   "source": [
    "## Encode"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "2402ebb9",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Set up the mean encoder\n",
    "\n",
    "enc = MeanEncoder()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "74ef4a1a",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style>#sk-container-id-1 {color: black;background-color: white;}#sk-container-id-1 pre{padding: 0;}#sk-container-id-1 div.sk-toggleable {background-color: white;}#sk-container-id-1 label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-container-id-1 label.sk-toggleable__label-arrow:before {content: \"▸\";float: left;margin-right: 0.25em;color: #696969;}#sk-container-id-1 label.sk-toggleable__label-arrow:hover:before {color: black;}#sk-container-id-1 div.sk-estimator:hover label.sk-toggleable__label-arrow:before {color: black;}#sk-container-id-1 div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-container-id-1 div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-container-id-1 input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-container-id-1 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {content: \"▾\";}#sk-container-id-1 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-container-id-1 div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-container-id-1 div.sk-estimator:hover {background-color: #d4ebff;}#sk-container-id-1 div.sk-parallel-item::after {content: \"\";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-container-id-1 div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-serial::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: 0;}#sk-container-id-1 div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;position: relative;}#sk-container-id-1 div.sk-item {position: relative;z-index: 1;}#sk-container-id-1 div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;position: relative;}#sk-container-id-1 div.sk-item::before, #sk-container-id-1 div.sk-parallel-item::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: -1;}#sk-container-id-1 div.sk-parallel-item {display: flex;flex-direction: column;z-index: 1;position: relative;background-color: white;}#sk-container-id-1 div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-container-id-1 div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-container-id-1 div.sk-parallel-item:only-child::after {width: 0;}#sk-container-id-1 div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;}#sk-container-id-1 div.sk-label label {font-family: monospace;font-weight: bold;display: inline-block;line-height: 1.2em;}#sk-container-id-1 div.sk-label-container {text-align: center;}#sk-container-id-1 div.sk-container {/* jupyter's `normalize.less` sets `[hidden] { display: none; }` but bootstrap.min.css set `[hidden] { display: none !important; }` so we also need the `!important` here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */display: inline-block !important;position: relative;}#sk-container-id-1 div.sk-text-repr-fallback {display: none;}</style><div id=\"sk-container-id-1\" class=\"sk-top-container\"><div class=\"sk-text-repr-fallback\"><pre>MeanEncoder()</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class=\"sk-container\" hidden><div class=\"sk-item\"><div class=\"sk-estimator sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-1\" type=\"checkbox\" checked><label for=\"sk-estimator-id-1\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">MeanEncoder</label><div class=\"sk-toggleable__content\"><pre>MeanEncoder()</pre></div></div></div></div></div>"
      ],
      "text/plain": [
       "MeanEncoder()"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Find mean target value per category\n",
    "# (it uses the entire train set)\n",
    "\n",
    "enc.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "1667b70c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['country']"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Feature-engine's encoder finds categorical variables\n",
    "# by default\n",
    "\n",
    "enc.variables_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "90a34078",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'country': {'Belgium': 511.37853658536585,\n",
       "  'EIRE': 5579.161829268293,\n",
       "  'France': 2872.7475609756098,\n",
       "  'Germany': 3764.180012195122,\n",
       "  'Spain': 919.3335365853659,\n",
       "  'United Kingdom': 129124.83931707316}}"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# the encoding values\n",
    "\n",
    "enc.encoder_dict_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "2c4cf198",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>country</th>\n",
       "      <th>quantity</th>\n",
       "      <th>revenue</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>week</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2009-12-06</th>\n",
       "      <td>511.378537</td>\n",
       "      <td>143</td>\n",
       "      <td>439.1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2009-12-13</th>\n",
       "      <td>511.378537</td>\n",
       "      <td>10</td>\n",
       "      <td>8.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2009-12-20</th>\n",
       "      <td>511.378537</td>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2009-12-27</th>\n",
       "      <td>511.378537</td>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2010-01-03</th>\n",
       "      <td>511.378537</td>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "               country  quantity  revenue\n",
       "week                                     \n",
       "2009-12-06  511.378537       143    439.1\n",
       "2009-12-13  511.378537        10      8.5\n",
       "2009-12-20  511.378537         0      0.0\n",
       "2009-12-27  511.378537         0      0.0\n",
       "2010-01-03  511.378537         0      0.0"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Encode datasets\n",
    "\n",
    "X_train_t = enc.transform(X_train)\n",
    "X_test_t = enc.transform(X_test)\n",
    "\n",
    "X_train_t.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "85599ce7",
   "metadata": {},
   "source": [
    "Note that Belgium was replaced by 511.37 in all rows, even though on various occasions the revenue was 0. This may result in a \"look ahead\" bias."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "60a6c207",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "fsml",
   "language": "python",
   "name": "fsml"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.5"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {
    "height": "calc(100% - 180px)",
    "left": "10px",
    "top": "150px",
    "width": "165px"
   },
   "toc_section_display": true,
   "toc_window_display": true
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
